{
"cells": [
{
"cell_type": "markdown",
"id": "dd7a2a4c-db46-4f2b-814f-c51a9d7f90fe",
"metadata": {},
"source": [
"### Getting to grips with linear models in Python - Exercises & Answers"
]
},
{
"cell_type": "markdown",
"id": "9e633e37-cd01-4863-af66-2c1fac696ad8",
"metadata": {},
"source": [
"## 1. Basic statistics with `pingouin`\n",
"First, lets get comfortable with some basic statistics in Python using `pingouin`, so we're on familiar ground before we dive into the GLM.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "1abe6991-38d5-4d4e-8efa-5ac384e0f57b",
"metadata": {},
"source": [
"### a. Importing everything we need\n",
"You will need a few things to answer the exercises, so its good practice to import most things you will need here.\n",
"- Import `pandas`, `pingouin`, `statsmodels.formula.api` and `seaborn`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3c9f0bdf-66ed-4d38-971c-2f174d4fe2e6",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# Your answer here\n",
"# Import libraries with their usual imports\n",
"import pandas as pd\n",
"import pingouin as pg\n",
"import statsmodels.formula.api as smf\n",
"import seaborn as sns\n",
"\n",
"sns.set_style('whitegrid')"
]
},
{
"cell_type": "markdown",
"id": "ced7ae25-efee-406d-88e1-0ffb776d7dcc",
"metadata": {},
"source": [
"### b. Loading some data\n",
"We will apply the usual statistical approaches with an interesting dataset called `affairs`. This dataset comprises a survey of married people who report the number of affairs they have had, along with several other attributes. \n",
"\n",
"The variables are summarised here, and more detail can be found [here](https://vincentarelbundock.github.io/Rdatasets/doc/AER/Affairs.html)\n",
"- `affairs` - the number of instances of extramarital sex in the last year.\n",
"- `gender`\n",
"- `age` \n",
"- `yearsmarried`\n",
"- `children` - yes/no\n",
"- `religiousness` - degree of religiosity, higher numbers being more religious.\n",
"- `edudcation` - codes level of education (9 = grade school, 20 = PhD, and more in between).\n",
"- `occupation` - occupation coding according to a classification system called the Hollingshead.\n",
"- `rating` - self report of marital happiness (1 = very unhappy, 5 = very happy).\n",
"\n",
"We'll use some basic approaches to test some hypotheses in this dataset. First, read it into a dataframe called `affairs` from the following link, and examine the top 5 rows:\n",
"\n",
"https://vincentarelbundock.github.io/Rdatasets/csv/AER/Affairs.csv"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c77c8d98-a241-4ee6-a648-858587a8f9eb",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
rownames
\n",
"
affairs
\n",
"
gender
\n",
"
age
\n",
"
yearsmarried
\n",
"
children
\n",
"
religiousness
\n",
"
education
\n",
"
occupation
\n",
"
rating
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
4
\n",
"
0
\n",
"
male
\n",
"
37.0
\n",
"
10.00
\n",
"
no
\n",
"
3
\n",
"
18
\n",
"
7
\n",
"
4
\n",
"
\n",
"
\n",
"
1
\n",
"
5
\n",
"
0
\n",
"
female
\n",
"
27.0
\n",
"
4.00
\n",
"
no
\n",
"
4
\n",
"
14
\n",
"
6
\n",
"
4
\n",
"
\n",
"
\n",
"
2
\n",
"
11
\n",
"
0
\n",
"
female
\n",
"
32.0
\n",
"
15.00
\n",
"
yes
\n",
"
1
\n",
"
12
\n",
"
1
\n",
"
4
\n",
"
\n",
"
\n",
"
3
\n",
"
16
\n",
"
0
\n",
"
male
\n",
"
57.0
\n",
"
15.00
\n",
"
yes
\n",
"
5
\n",
"
18
\n",
"
6
\n",
"
5
\n",
"
\n",
"
\n",
"
4
\n",
"
23
\n",
"
0
\n",
"
male
\n",
"
22.0
\n",
"
0.75
\n",
"
no
\n",
"
2
\n",
"
17
\n",
"
6
\n",
"
3
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rownames affairs gender age yearsmarried children religiousness \\\n",
"0 4 0 male 37.0 10.00 no 3 \n",
"1 5 0 female 27.0 4.00 no 4 \n",
"2 11 0 female 32.0 15.00 yes 1 \n",
"3 16 0 male 57.0 15.00 yes 5 \n",
"4 23 0 male 22.0 0.75 no 2 \n",
"\n",
" education occupation rating \n",
"0 18 7 4 \n",
"1 14 6 4 \n",
"2 12 1 4 \n",
"3 18 6 5 \n",
"4 17 6 3 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Your answer here\n",
"# Read in data show top 5\n",
"affairs = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/AER/Affairs.csv')\n",
"affairs.head()"
]
},
{
"cell_type": "markdown",
"id": "13f6773c-633d-4efc-b4ff-c8bd5e1e4fc9",
"metadata": {},
"source": [
"### c. Using correlation to test relationships\n",
"Conduct a correlation to test whether there is an association between self-reported marital happiness is associated with number of affairs. What is your prediction on the direction? \n",
"\n",
"Conduct a correlation with `pingouin`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8190ab0a-73fd-4b11-a28f-b96310c6256e",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
n
\n",
"
r
\n",
"
CI95%
\n",
"
p-val
\n",
"
BF10
\n",
"
power
\n",
"
\n",
" \n",
" \n",
"
\n",
"
pearson
\n",
"
601
\n",
"
-0.279512
\n",
"
[-0.35, -0.2]
\n",
"
3.002385e-12
\n",
"
1.796e+09
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" n r CI95% p-val BF10 power\n",
"pearson 601 -0.279512 [-0.35, -0.2] 3.002385e-12 1.796e+09 1.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Your answer here\n",
"# Correlation\n",
"cor = pg.corr(affairs['affairs'], affairs['rating'])\n",
"display(cor)"
]
},
{
"cell_type": "markdown",
"id": "9de47229-5065-4f9f-8012-4e0fe11649e2",
"metadata": {},
"source": [
"### d. Using t-tests to examine differences\n",
"Does the presence of children impact marital satisfaction? Use a t-test in `pingouin` to examine whether satisfaction differs between marriages with and without children.\n",
"\n",
"If it is significant, calculate the averages between the groups to see what the differences are. Use seaborn to make a plot of the ratings between the two groups."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d5c9acb5-b38d-429c-a448-77f22126c2b9",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Your answer here\n",
"# T-test\n",
"display(pg.ttest(affairs.query('children == \"yes\"')['rating'], \n",
" affairs.query('children == \"no\"')['rating']))\n",
"\n",
"# Means\n",
"means = affairs.groupby(by=['children']).agg({'rating': 'mean'})\n",
"display(means)\n",
"\n",
"# Plot with a barchart but other options might work\n",
"sns.barplot(data=affairs, x='children', y='rating')"
]
},
{
"cell_type": "markdown",
"id": "a3443c97-1941-46cf-b62e-3086ae400b75",
"metadata": {},
"source": [
"### e. Using ANOVA to test slightly more complicated relationships\n",
"Does having children affect marital satisfaction differently for males and females?\n",
"\n",
"To answer this, you need an 'two-way' ANOVA (eg - its a linear model with TWO predictors, each being categorical). Use ANOVA to test these differences with `pingouin`, and if you observe a difference, work out the means with `pandas` and again make a plot with `seaborn`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b5e00597-103a-43af-b9eb-366b5474913b",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Source
\n",
"
SS
\n",
"
DF
\n",
"
MS
\n",
"
F
\n",
"
p-unc
\n",
"
np2
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
children
\n",
"
28.116062
\n",
"
1.0
\n",
"
28.116062
\n",
"
24.112908
\n",
"
0.000001
\n",
"
0.038822
\n",
"
\n",
"
\n",
"
1
\n",
"
gender
\n",
"
0.026971
\n",
"
1.0
\n",
"
0.026971
\n",
"
0.023131
\n",
"
0.879169
\n",
"
0.000039
\n",
"
\n",
"
\n",
"
2
\n",
"
children * gender
\n",
"
5.933417
\n",
"
1.0
\n",
"
5.933417
\n",
"
5.088620
\n",
"
0.024444
\n",
"
0.008452
\n",
"
\n",
"
\n",
"
3
\n",
"
Residual
\n",
"
696.112181
\n",
"
597.0
\n",
"
1.166017
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Source SS DF MS F p-unc \\\n",
"0 children 28.116062 1.0 28.116062 24.112908 0.000001 \n",
"1 gender 0.026971 1.0 0.026971 0.023131 0.879169 \n",
"2 children * gender 5.933417 1.0 5.933417 5.088620 0.024444 \n",
"3 Residual 696.112181 597.0 1.166017 NaN NaN \n",
"\n",
" np2 \n",
"0 0.038822 \n",
"1 0.000039 \n",
"2 0.008452 \n",
"3 NaN "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Your answer here\n",
"# Carry out anova\n",
"aov = pg.anova(data=affairs, dv='rating', between=['children', 'gender'])\n",
"display(aov)\n",
"\n",
"# Compute means, the interaction is significant\n",
"means = affairs.groupby(by=['children', 'gender']).agg({'rating': 'mean'})\n",
"display(means)\n",
"\n",
"# Plot\n",
"sns.barplot(data=affairs, x='gender', y='rating', hue='children');"
]
},
{
"cell_type": "markdown",
"id": "b02bab8f-69de-44c5-b7e2-446a7c94c45f",
"metadata": {},
"source": [
"### e. Carrying out an analysis of covariance (ANCOVA).\n",
"A bit of a challenge. A popular statistical approach is the ANCOVA, which carries out an ANOVA while 'controlling' for another continuous variable, or a 'covariate'. This is simply confusing language for a general linear model with categorical predictors and a continuous predictor, which we will examine in more detail soon. \n",
"\n",
"Here, can you carry out an `ancova` with `pingouin`, assessing whether males and differ in the number of affairs they have, controlling for the years they have been married? Check the `pingouin` website [here](https://pingouin-stats.org/build/html/api.html) for help - reading documentation is a key skill in statistics and programming! Set the `effsize` (effect size) to 'n2', or standard $R^2$.\n",
"\n",
"\n",
"As an additional challenge, if there are any significant results, can you plot the relationship between years married and affairs, and colour the points by male vs female respondents?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "9bdf87d2-3b38-442c-a009-5d2c51e895d3",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Your answer here\n",
"# ANCOVA like this\n",
"aocv = pg.ancova(data=affairs, dv='affairs', between='gender', covar=['yearsmarried'], effsize='n2')\n",
"display(aocv)\n",
"\n",
"# Plot like this\n",
"sns.scatterplot(data=affairs, x='yearsmarried', y='affairs', hue='gender');"
]
},
{
"cell_type": "markdown",
"id": "d0ff650d-9866-49d7-98a2-06b5064b1309",
"metadata": {},
"source": [
"## 2. The general linear model with `statsmodels`\n",
"Now we will get used to applying the general linear model to fit more complex analyses, repeat some old ones, and get to grips with the attributes of the model and how we can manipulate it."
]
},
{
"cell_type": "markdown",
"id": "8a679880-ceed-430a-b926-2ea8fef3291e",
"metadata": {},
"source": [
"### a. Fitting an actual linear model\n",
"We will start by using `statsmodels` to conduct a regression.\n",
"\n",
"Suppose we have a theory that marital satisfaction is determined by the contribution of several predictors, that each influence how satisfied someone is in their marriage. Using the GLM, we will aim to predict marital satisfaction from the respondents gender, whether they have children (both categorical variables!), degree of religiosity, and the number of years they have been married. Notice we are already including a mix of variables - there is no restriction on variable types.\n",
"\n",
"Fit this model below, and have Python show the summary."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c0ed547f-56af-451f-a668-3695fa9ae961",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
OLS Regression Results
\n",
"
\n",
"
Dep. Variable:
rating
R-squared:
0.070
\n",
"
\n",
"
\n",
"
Model:
OLS
Adj. R-squared:
0.064
\n",
"
\n",
"
\n",
"
No. Observations:
601
F-statistic:
11.27
\n",
"
\n",
"
\n",
"
Covariance Type:
nonrobust
Prob (F-statistic):
8.07e-09
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
coef
std err
t
P>|t|
[0.025
0.975]
\n",
"
\n",
"
\n",
"
Intercept
4.1801
0.141
29.650
0.000
3.903
4.457
\n",
"
\n",
"
\n",
"
gender[T.male]
0.0093
0.087
0.106
0.916
-0.162
0.181
\n",
"
\n",
"
\n",
"
children[T.yes]
-0.2094
0.118
-1.775
0.076
-0.441
0.022
\n",
"
\n",
"
\n",
"
religiousness
0.0771
0.038
2.017
0.044
0.002
0.152
\n",
"
\n",
"
\n",
"
yearsmarried
-0.0420
0.010
-4.329
0.000
-0.061
-0.023
\n",
"
\n",
"
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/latex": [
"\\begin{center}\n",
"\\begin{tabular}{lclc}\n",
"\\toprule\n",
"\\textbf{Dep. Variable:} & rating & \\textbf{ R-squared: } & 0.070 \\\\\n",
"\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.064 \\\\\n",
"\\textbf{No. Observations:} & 601 & \\textbf{ F-statistic: } & 11.27 \\\\\n",
"\\textbf{Covariance Type:} & nonrobust & \\textbf{ Prob (F-statistic):} & 8.07e-09 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lcccccc}\n",
" & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n",
"\\midrule\n",
"\\textbf{Intercept} & 4.1801 & 0.141 & 29.650 & 0.000 & 3.903 & 4.457 \\\\\n",
"\\textbf{gender[T.male]} & 0.0093 & 0.087 & 0.106 & 0.916 & -0.162 & 0.181 \\\\\n",
"\\textbf{children[T.yes]} & -0.2094 & 0.118 & -1.775 & 0.076 & -0.441 & 0.022 \\\\\n",
"\\textbf{religiousness} & 0.0771 & 0.038 & 2.017 & 0.044 & 0.002 & 0.152 \\\\\n",
"\\textbf{yearsmarried} & -0.0420 & 0.010 & -4.329 & 0.000 & -0.061 & -0.023 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"%\\caption{OLS Regression Results}\n",
"\\end{center}\n",
"\n",
"Notes: \\newline\n",
" [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/plain": [
"\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: rating R-squared: 0.070\n",
"Model: OLS Adj. R-squared: 0.064\n",
"No. Observations: 601 F-statistic: 11.27\n",
"Covariance Type: nonrobust Prob (F-statistic): 8.07e-09\n",
"===================================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"-----------------------------------------------------------------------------------\n",
"Intercept 4.1801 0.141 29.650 0.000 3.903 4.457\n",
"gender[T.male] 0.0093 0.087 0.106 0.916 -0.162 0.181\n",
"children[T.yes] -0.2094 0.118 -1.775 0.076 -0.441 0.022\n",
"religiousness 0.0771 0.038 2.017 0.044 0.002 0.152\n",
"yearsmarried -0.0420 0.010 -4.329 0.000 -0.061 -0.023\n",
"===================================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Your answer here\n",
"# Fit the model like so\n",
"model = smf.ols('rating ~ gender + children + religiousness + yearsmarried', data=affairs).fit()\n",
"display(model.summary(slim=True))"
]
},
{
"cell_type": "markdown",
"id": "8f1f0cd7-758e-4d54-9796-9726b1bccd30",
"metadata": {},
"source": [
"### b. Interpreting the coefficients\n",
"Now for the tricky part. What does each coefficient mean?\n",
"\n",
"- What is the definition of the intercept?\n",
"- What is the definition of the `gender[T.male]` coefficient?\n",
"- What is the definition of the `children[T.yes]` coefficient?\n",
"- What is the definition of the `religiousness` coefficient?\n",
"- What is the definition of the `yearsmarried` coefficient?\n",
"\n",
"Do any of these make sense in their current definition?"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c9cb96cd-c2d5-4d6c-ba73-881ddefb25e5",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# Your answer here\n",
"# Intercept - average satisfaction when all other predictors at zero - which means for females with no children!\n",
"# Gender - change in for males compared to females, holding all constant\n",
"# Children - change in satisfaction for someone with children compared to without, holding all constant\n",
"# Religiousness - Increase in satisfaction for a 1 step up the scale on religiosity, holding all constant\n",
"# Years Married - Increase in satisfaction with every additional unit of years married (one year)"
]
},
{
"cell_type": "markdown",
"id": "a1894e92-5f90-418f-8735-380006b9e89f",
"metadata": {},
"source": [
"### c. Altering the model for interpretability\n",
"The previous model suggests some interesting relationships, but it is a little difficult to interpret it. For example, religiousness does not have a clear meaning (it ranges 1, 2, 3... etc), and years married has some slight difficulty in interpretation. It may be better to scale those two variables, so that the intercept makes sense. \n",
"\n",
"Use the formula syntax to scale the `religiousness` and `yearsmarried` predictors, and re-fit the model. Why would we not scale the `gender` and `children` predictors?"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "861a6436-2558-423a-9a96-d9852c8a473f",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
OLS Regression Results
\n",
"
\n",
"
Dep. Variable:
rating
R-squared:
0.070
\n",
"
\n",
"
\n",
"
Model:
OLS
Adj. R-squared:
0.064
\n",
"
\n",
"
\n",
"
No. Observations:
601
F-statistic:
11.27
\n",
"
\n",
"
\n",
"
Covariance Type:
nonrobust
Prob (F-statistic):
8.07e-09
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
coef
std err
t
P>|t|
[0.025
0.975]
\n",
"
\n",
"
\n",
"
Intercept
4.0772
0.102
40.168
0.000
3.878
4.277
\n",
"
\n",
"
\n",
"
gender[T.male]
0.0093
0.087
0.106
0.916
-0.162
0.181
\n",
"
\n",
"
\n",
"
children[T.yes]
-0.2094
0.118
-1.775
0.076
-0.441
0.022
\n",
"
\n",
"
\n",
"
scale(religiousness)
0.0900
0.045
2.017
0.044
0.002
0.178
\n",
"
\n",
"
\n",
"
scale(yearsmarried)
-0.2336
0.054
-4.329
0.000
-0.340
-0.128
\n",
"
\n",
"
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/latex": [
"\\begin{center}\n",
"\\begin{tabular}{lclc}\n",
"\\toprule\n",
"\\textbf{Dep. Variable:} & rating & \\textbf{ R-squared: } & 0.070 \\\\\n",
"\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.064 \\\\\n",
"\\textbf{No. Observations:} & 601 & \\textbf{ F-statistic: } & 11.27 \\\\\n",
"\\textbf{Covariance Type:} & nonrobust & \\textbf{ Prob (F-statistic):} & 8.07e-09 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lcccccc}\n",
" & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n",
"\\midrule\n",
"\\textbf{Intercept} & 4.0772 & 0.102 & 40.168 & 0.000 & 3.878 & 4.277 \\\\\n",
"\\textbf{gender[T.male]} & 0.0093 & 0.087 & 0.106 & 0.916 & -0.162 & 0.181 \\\\\n",
"\\textbf{children[T.yes]} & -0.2094 & 0.118 & -1.775 & 0.076 & -0.441 & 0.022 \\\\\n",
"\\textbf{scale(religiousness)} & 0.0900 & 0.045 & 2.017 & 0.044 & 0.002 & 0.178 \\\\\n",
"\\textbf{scale(yearsmarried)} & -0.2336 & 0.054 & -4.329 & 0.000 & -0.340 & -0.128 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"%\\caption{OLS Regression Results}\n",
"\\end{center}\n",
"\n",
"Notes: \\newline\n",
" [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/plain": [
"\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: rating R-squared: 0.070\n",
"Model: OLS Adj. R-squared: 0.064\n",
"No. Observations: 601 F-statistic: 11.27\n",
"Covariance Type: nonrobust Prob (F-statistic): 8.07e-09\n",
"========================================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"----------------------------------------------------------------------------------------\n",
"Intercept 4.0772 0.102 40.168 0.000 3.878 4.277\n",
"gender[T.male] 0.0093 0.087 0.106 0.916 -0.162 0.181\n",
"children[T.yes] -0.2094 0.118 -1.775 0.076 -0.441 0.022\n",
"scale(religiousness) 0.0900 0.045 2.017 0.044 0.002 0.178\n",
"scale(yearsmarried) -0.2336 0.054 -4.329 0.000 -0.340 -0.128\n",
"========================================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Your answer here\n",
"# Scale like so\n",
"model_two = smf.ols('rating ~ gender + children + scale(religiousness) + scale(yearsmarried)', data=affairs).fit()\n",
"display(model_two.summary(slim=True))"
]
},
{
"cell_type": "markdown",
"id": "04ec187b-f4da-48d8-abf6-9567f69951f6",
"metadata": {},
"source": [
"Now what do the coefficients mean? How have they changed compared to the last one? "
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "17b74f16-48a1-40c9-8a14-4b3fc9e9b331",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# Your answer here\n",
"# Intercept - average satisfaction when all other predictors at zero - which means for females with no children still\n",
"# Gender - change in for males compared to females, holding all constant\n",
"# Children - change in satisfaction for someone with children compared to without, holding all constant\n",
"# Religiousness - Increase in satisfaction for a 1SD increase on religiosity, holding all constant\n",
"# Years Married - Increase in satisfaction a 1SD increase in years married (one year)"
]
},
{
"cell_type": "markdown",
"id": "fcf1ae39-ca34-4530-b7b9-902a9c3e94cc",
"metadata": {},
"source": [
"### d. Evaluating the model\n",
"The `.summary()` output suggests that while our model is statistically significant, it explains just 7% of the variance in satisfaction ratings. But we should not rely on this as a way to evaluate our model. Instead, we would like to know how *wrong our model is, on average*. And to know that, we have to compute the root-mean-squared-error, or RMSE.\n",
"\n",
"Check the notes for the import and code that will do this. Take your fitted models values, and use the correct function to calculate the RMSE. \n",
"\n",
"What is it? Is it any good? "
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "eacbfd95-5cb7-44ef-ae73-5277ac384455",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.0628082871251807\n"
]
}
],
"source": [
"# Your answer here\n",
"# Import rmse from statsmodels\n",
"from statsmodels.tools.eval_measures import rmse\n",
"\n",
"# Call it\n",
"accuracy = rmse(model_two.fittedvalues, affairs['rating'])\n",
"print(accuracy)"
]
},
{
"cell_type": "markdown",
"id": "34c3cb0d-1f9c-4c50-899f-774fbac4b74b",
"metadata": {},
"source": [
"Note that 'any good' depends entirely on your use case, though lower is generally better. Here, we're wrong by about 1 rating scale point. That might be acceptable in some scenarios, but here it suggests that this is not particularly useful! We might consider more predictors, or more complex models, but we will get to that in time."
]
},
{
"cell_type": "markdown",
"id": "e83a5eb5-c6e7-477c-9a70-bf793b387d35",
"metadata": {},
"source": [
"## 3. Comparing and contrasting analyses through the GLM\n",
"Finally, we will practice demonstrating comparisons between the analysis you performed earlier with `pingouin` and their GLM equivalents, just to drill home their similarities and to broaden your thinking about how models are used."
]
},
{
"cell_type": "markdown",
"id": "59791754-afd7-49d0-8542-b318bf151bca",
"metadata": {},
"source": [
"### a. Correlation\n",
"Earlier, you showed that the number of affairs and marital satisfaction were negatively correlated. Reproduce that analysis with a GLM in `statsmodels` below."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "4ae2cac9-40a5-4ea0-a397-a0ab4673fa4c",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
OLS Regression Results
\n",
"
\n",
"
Dep. Variable:
scale(affairs)
R-squared:
0.078
\n",
"
\n",
"
\n",
"
Model:
OLS
Adj. R-squared:
0.077
\n",
"
\n",
"
\n",
"
No. Observations:
601
F-statistic:
50.76
\n",
"
\n",
"
\n",
"
Covariance Type:
nonrobust
Prob (F-statistic):
3.00e-12
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
coef
std err
t
P>|t|
[0.025
0.975]
\n",
"
\n",
"
\n",
"
Intercept
1.615e-16
0.039
4.12e-15
1.000
-0.077
0.077
\n",
"
\n",
"
\n",
"
scale(rating)
-0.2795
0.039
-7.125
0.000
-0.357
-0.202
\n",
"
\n",
"
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/latex": [
"\\begin{center}\n",
"\\begin{tabular}{lclc}\n",
"\\toprule\n",
"\\textbf{Dep. Variable:} & scale(affairs) & \\textbf{ R-squared: } & 0.078 \\\\\n",
"\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.077 \\\\\n",
"\\textbf{No. Observations:} & 601 & \\textbf{ F-statistic: } & 50.76 \\\\\n",
"\\textbf{Covariance Type:} & nonrobust & \\textbf{ Prob (F-statistic):} & 3.00e-12 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lcccccc}\n",
" & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n",
"\\midrule\n",
"\\textbf{Intercept} & 1.615e-16 & 0.039 & 4.12e-15 & 1.000 & -0.077 & 0.077 \\\\\n",
"\\textbf{scale(rating)} & -0.2795 & 0.039 & -7.125 & 0.000 & -0.357 & -0.202 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"%\\caption{OLS Regression Results}\n",
"\\end{center}\n",
"\n",
"Notes: \\newline\n",
" [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/plain": [
"\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: scale(affairs) R-squared: 0.078\n",
"Model: OLS Adj. R-squared: 0.077\n",
"No. Observations: 601 F-statistic: 50.76\n",
"Covariance Type: nonrobust Prob (F-statistic): 3.00e-12\n",
"=================================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"---------------------------------------------------------------------------------\n",
"Intercept 1.615e-16 0.039 4.12e-15 1.000 -0.077 0.077\n",
"scale(rating) -0.2795 0.039 -7.125 0.000 -0.357 -0.202\n",
"=================================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Your answer here\n",
"# It is the scaling that makes this analysis so\n",
"smf.ols('scale(affairs) ~ scale(rating)', data=affairs).fit().summary(slim=True)"
]
},
{
"cell_type": "markdown",
"id": "6e89d059-1ea9-4e3a-9233-21b149ce2e45",
"metadata": {},
"source": [
"### b. T-test\n",
"Can you reproduce the t-test as you did earlier, comparing satisfaction for those with and without children? Note - the t-value will not match exactly unless you turned off the 'correction' in the t-test. T-tests are for cowards. \n",
"\n",
"What do the coefficients mean here? Which category of the `children` variable became the intercept?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "158599b7-355e-424d-94bc-d9f0c072a640",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
OLS Regression Results
\n",
"
\n",
"
Dep. Variable:
rating
R-squared:
0.039
\n",
"
\n",
"
\n",
"
Model:
OLS
Adj. R-squared:
0.037
\n",
"
\n",
"
\n",
"
No. Observations:
601
F-statistic:
24.00
\n",
"
\n",
"
\n",
"
Covariance Type:
nonrobust
Prob (F-statistic):
1.24e-06
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
coef
std err
t
P>|t|
[0.025
0.975]
\n",
"
\n",
"
\n",
"
Intercept
4.2749
0.083
51.635
0.000
4.112
4.437
\n",
"
\n",
"
\n",
"
children[T.yes]
-0.4795
0.098
-4.899
0.000
-0.672
-0.287
\n",
"
\n",
"
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/latex": [
"\\begin{center}\n",
"\\begin{tabular}{lclc}\n",
"\\toprule\n",
"\\textbf{Dep. Variable:} & rating & \\textbf{ R-squared: } & 0.039 \\\\\n",
"\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.037 \\\\\n",
"\\textbf{No. Observations:} & 601 & \\textbf{ F-statistic: } & 24.00 \\\\\n",
"\\textbf{Covariance Type:} & nonrobust & \\textbf{ Prob (F-statistic):} & 1.24e-06 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lcccccc}\n",
" & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n",
"\\midrule\n",
"\\textbf{Intercept} & 4.2749 & 0.083 & 51.635 & 0.000 & 4.112 & 4.437 \\\\\n",
"\\textbf{children[T.yes]} & -0.4795 & 0.098 & -4.899 & 0.000 & -0.672 & -0.287 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"%\\caption{OLS Regression Results}\n",
"\\end{center}\n",
"\n",
"Notes: \\newline\n",
" [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/plain": [
"\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: rating R-squared: 0.039\n",
"Model: OLS Adj. R-squared: 0.037\n",
"No. Observations: 601 F-statistic: 24.00\n",
"Covariance Type: nonrobust Prob (F-statistic): 1.24e-06\n",
"===================================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"-----------------------------------------------------------------------------------\n",
"Intercept 4.2749 0.083 51.635 0.000 4.112 4.437\n",
"children[T.yes] -0.4795 0.098 -4.899 0.000 -0.672 -0.287\n",
"===================================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Your answer here\n",
"# t-test is easy\n",
"smf.ols('rating ~ children', data=affairs).fit().summary(slim=True)"
]
},
{
"cell_type": "markdown",
"id": "972ed37b-9280-4c83-93ed-bc8ea502dbcb",
"metadata": {},
"source": [
"### c. ANCOVA\n",
"We will return to the link between ANOVA and GLM's once we expand our modelling knowledge. But for now, we can demonstrate the equivalence between *ANCOVA* and a linear model. An ANCOVA is essentially a set of categorical predictors with several continuous predictors alongside. In the above challenge, you used an ANCOVA to address whether males and females differed in the number of affairs, while controlling for the length of their marriage. Can you reproduce that in a linear model form below, and can you draw comparisons to the ANCOVA output and the linear model version?"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "9b472619-e57e-488c-9329-0e4cf2024b01",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"